Bridging the Search Gap between the Web of Pages and Web of Data by Combining Ontological Document Expansion with Text Search
نویسندگان
چکیده
The Semantic Web extends traditional web documents, i.e. the Web of Pages, with conceptual structures based on ontologies and metadata, i.e. the Web of Data. This paper presents a hybrid document search approach combining the benefits of the traditional text search of literal documents and the semantic search based on their underlying conceptual structures. The approach is based on document expansion, where documents are automatically annotated with not only the concepts explicitly present in a given document, but also with the ontologically related concepts using smaller weights. Our test results using the CLEF Test Suite suggest that document expansion alone achieves better recall than text search at the expense of precision. As a solution, a method of combining document expansion with text search is presented in which better recall was obtained without sacrificing precision. This approach seems promising when integrating unstructured, textual content with the Semantic Web of Data. 1 Text Search vs. Semantic Search The Semantic Web1 extends web pages, documents, and other web materials with machine understandable, ontology-based [21] network of metadata attached to the contents. In the case of text documents, a central part of the metadata describes the subject matter of the text, and is directly related with the literal words and expressions of the documents. When searching text documents, two major approaches are available: 1. In text search the query is matched against the textual expressions of the documents, and search takes place in a literal space. In spite of the success of traditional text search engines, this approach has severe fundamental limitations [5] in its basic form. For example, recall is lowered because the query word cannot be matched with synonyms or semantically close terms. For example, query “student” does not match documents about pupils, “bird” does not match descriptions about eagles, and query “nordic country” does not match with Finland or Sweden. At the same time, precision is lessened due to polysemy and homonymy of words. For example, query “bank” matches financial banks, blood banks, and river banks. It is often possible to improve precision at the expense of recall, and vice versa [4]. 1 http://www.w3.org/2001/sw/ 2. Semantic search [8] tries to address the limitations of text search by performing search in a conceptual space, based on disambiguated concepts rather than literal words, and by utilizing semantic networks of concepts underlying the texts. Such a search should be more precise since homonymous queries can be disambiguated before the query, or by clustering the results according to different interpretations of the query afterwards. At the same time, recall can be improved by extending the query to synonyms and semantically related ontological concepts by query expansion [15, 11]. Both of these methods can be further refined in the WWW environment through the use of links between pages to find relevant documents not included in the original result set and to rank the results to better reflect their authoritativeness [16]. Automated query expansion methods can be broken down into 1) methods based on search results and 2) ones based on knowledge structures, the latter of which can be further grouped into collection dependent and collection independent methods [4]. Methods based on search results first perform a query using the query terms as given by the user after which a new query is formed based on terms with high occurrence in the result set. Methods based on knowledge structures either use corpus-based knowledge of, for example, correlations between different terms or use some a priori knowledge like relations between different concepts. This latter approach lends itself well to document expansion where the query expansion is not done dynamically in response to a user query but rather in advance during indexing. Unfortunately semantic search is not a panacea but has its own difficulties and limitations. For example, expanding the query or documents semantically raises recall but may dramatically lower the precision unless the expansion strategy is carefully tuned [11]. On the other hand, matching precise search concepts with conceptual metadata may lower recall because conceptual representations cannot model very well e.g. the uncertain or fuzzy meaning of real world concepts [9]. Furthermore, the research tradition of information retrieval [1] has produced lots of useful methods and techniques such as TF-IDF [19] for ordering search results according to their relevance w.r.t. the query. Oddly enough, the issue of relevance has not yet been discussed much in the semantic search community, although its has been a key issue behind the success of search systems such as Google. It therefore seems worthwhile to investigate whether it is useful to combine ideas from text search with those of semantic search, as suggested e.g. in [10] for an optimal hybrid search strategy. This paper presents such a study in the application domain of news paper articles. In the following, we present a document expansion method utilizing ontologies by which the text documents can be annotated automatically and different semantic search strategies can be performed. We then test some combinations of semantic search and text search and measure the results based on an article data set of the CLEF Test Suite2 and its golden standard. The results of our experiments suggest that substantial benefits in terms of precision, recall, and relevance can be obtained by combining methods of text and semantic search in smart ways. 2 http://www.clef-campaign.org/ 2 A Hybrid Search and Recommendation Architecture 2.1 Document expansion versus query expansion in the Semantic Web domain The difference between document expansion and query expansion is basically the timing of the expansion step. In document expansion the terms are expanded during the indexing phase for each individual document. In query expansion only the query terms are expanded and this is done dynamically when a user performs a search into the database. Document expansion, though less frequently used than query expansion, has some important benefits when used in the Semantic Web domain. First of all it does not put a strain on the search system computationally as it is performed in advance during the indexing stage. Also, when done using ontologies, it expands the documents or other resources to the Semantic Web of Data allowing these documents and resources to be linked together in new ways. If the ontological expansion is done in the query phase, it expands only the query and not the actual documents and resources. This means that in order to link resources through more complex relations the links are made longer which, in the case of larger ontologies, can expand the query originally comprised of a few terms into thousands of terms. An example of this would be if the user searches a database for all documents that are about birds. Using a bird name ontology like AVIO3 with 9740 individual species would find all the documents mentioning any individual bird species even if they lack the original query term, but the query itself would be almost ten thousand terms long. Most search engines have limits to the size of the input which renders queries of thousands of terms impossible. 2.2 Ontological Concept Clustering The hybrid search architecture and process used in our case study, based on text search and semantic search using ontology based document expansion, is depicted in Figure 1. The process starts with the lemmatization of a given document (cf. the upper left corner of the figure). After this, stop words are filtered out and the text is indexed into a conventional term-document matrix [20]. In order to facilitate semantic search, each lemmatized term in a document is matched with ontological concepts using labels present in an ontology (cf. box “concept matching” in the figure). If a match is found, then the concept’s URI is added to the document’s metadata as a subject annotation. Several ontologies can be used during indexing and the concepts found from each ontology are saved in their own subject fields. The characteristics required of an ontology are simply the existence of concept labels and some sort of relations between concepts, so in theory thesauri can also be used, though the typically more limited relations would result in worse performance overall. A separate index is built for each ontology/field. The process is fully automatic since we are dealing with thousands of news paper articles, which makes manual checking infeasible. Homonymous terms are not disambiguated by human intervention or using other techniques semantically, but are indexed using multiple meanings. 3 http://www.yso.fi/onki/avio/?l=en Fig. 1. The process of document expansion through ontological concept clustering The document expansion is performed in the box “concept clustering” by expanding each matched concept c into a larger set that consists of other concepts semantically related to c in the ontology. The goal of using document expansion is to provide the user with larger, relevant result sets by using the semantic information underlying the documents. Document expansion is done by following an ontology specific pattern expressed in a pattern language developed for the task. A pattern is comprised of paths made up of the relations in the target ontology. Each path specifies the relations, or steps, that make up the path, the depth to which those relations are to be followed, as well as a weighting coefficient which determines the importance of related concepts found using the path. The idea of giving a weight to related concepts is an extension to earlier query expansion patterns such as [11]. For example, the direct superclasses of c can be given a high weight value, and the subclasses a smaller weight. The pattern language used is presented in more detail in section 2.3. Fig. 2. The ontological expansion of a single concept Figure 2 shows an example of ontologically expanding a single concept. The vertical arrows denote subclass relations while the horizontal arrows show related concept relations. When the concept ’locomotives’ is found in the text, ontologically related other concepts are given weight (shown with black in the figure) depending on the nature of the relation. The result of the document expansion is a cluster (set) of concepts annotating the subject of the document. The concepts literally present in the document have a weight of one and the semantically related concepts have typically smaller values between zero and one. In practice the weights of related concepts should be kept low and balanced so that they summarize the subject matter in a semantically correct and balanced way. For the final cluster, the weights are multiplied by the square root of the frequency of the occurrence of the original concept. This means that a concept gains more weight when it has a relation to several different concepts that are present in the document. The use of a balancing function (square root) in this step is needed in order to avoid a single concept with a high occurrence frequency from dragging its whole cluster up too high in the final index. Instead of square root some other balancing function could be used, which has some impact on the results. Finding the optimal balancing function is a whole new problem and is disregarded here. After the concept clustering step has been performed for every concept found in the document, the clusters are added together and the weights are rounded to the nearest integer. The rounding is done because the URI of each concept is added to their respective ontological index as many time as the rounded sum of its weight indicates and this in turn allows the use of TF-IDF balancing in the concept index. This keeps the highly connected ontological concepts from dominating the search results. When a query is issued into the system (cf. lower left corner in Figure 1, it is lemmatized and directed to the text search index as normal. On the semantic search side, an ontological concept matching is performed (in the same vein as when indexing the contents) and the resolved concepts are used as queries into the ontological concept indices created from the expanded documents. The result sets of all queries, based on text and semantic search, can then be combined in several different ways to produce different outcomes for the end user. Our research question is whether this can be done in a way that provides better search results than text search or semantic search alone. The choices and evaluation results of some strategies will be presented in the evaluation section (see 3.2). 2.3 Pattern Language A crucial part of ontological document expansion is the pattern which defines the ontological relations that are to be followed when constructing a cluster around a given concept. A pattern is comprised of paths made up of hierarchical and associative relations in a given ontology. It is ontology-specific and should be tailored to a specific database in order to take full advantage of the proposed method as different domains place varying emphasis on different relations. Because of this patterns should be easy to construct when configuring the system for new applications. An XML-based pattern language was developed with this in mind. The basic layout of a pattern is as follows: – A pattern is comprised of one or several paths – A path is comprised of one or several relations or steps – Each path includes a weight which is applied to the resources at the end of the path Each step of the path includes a relation and information on whether it should be traversed towards the object or the subject of the triplet. This has to be done because triplets are directed and not all relations have an inverse relation specified, but it can still be useful to traverse the relation in that direction. An example of this is the property rdfs:subClassOf, which is used to build the class hierarchy for ontologies. Its inverse, i.e. the superClassOf-relation, is not normally defined, yet it is often interesting to traverse the hierarchy towards subclasses, too. Aside from these obligatory definitions, the pattern language includes a number of definitions for ease of use. First one is depth, which determines how many times a given step is to be performed until proceeding to the next step. Another is inclusiveness, which determines whether the weight is to be applied to every concept along the path or just to the final set at the end of the last step. Relations of path Depth Weight Is inclusive subClassOf (s), subClassOf (o) 1,1 0.05 false associativeRelation (s) 1 0.2 true subClassOf (s) 1 0.05 true subClassOf (0) 1 0.1 true Table 1. The clustering pattern used for the evaluation An example pattern is depicted in Table 1. Each row in the table describes one path. The first column shows the relations that make up the path with either (s) or (o) depending on whether the relation is to be followed starting from the subject or the object of the triplet. From the table we can see, for example in the last two paths, that a higher weight is given to the subclasses of a given concept than to its superclasses. Finally, an XML serialization of the pattern language was realized. 2.4 Searching vs. Recommending An interesting question raised by document expansion is the relation between semantic search and semantic recommendation used as a key component in some semantic portals [13]. The idea of semantic recommendations is to provide the user with additional semantically related hits that are likely to be of interest to her, but that cannot be included in the search result. This is because the connection between the query and recommendations is not necessarily obvious, and the recommendations could look like wrong hits without further explanation. For example, by using our pattern language it is possibly to include in the result set of a query ’Finland’ a document that is related to ’Sweden’, a neighboring country, if the geospatial relation is considered important. The article may then not be connected with the query in terms of subject matter at all. When expanding a query or a document semantically, the vague borderline between search hits and recommendations is easily crossed, and the actual search results get mixed with the recommendations. In our view, the distinction is useful and clarifying from an end-user’s view point, as illustrated is systems such as [12, 14]. We therefore decided to investigate hybrid strategies between text search and semantic recommendation in our case study, too. For this purpose a recommendation scheme was devised that picks a number of the most relevant documents returned by the text search, for example ten. These documents are then searched for the concepts that occur in more than one document, and an intersection of the found concepts is used to form a new query into the concept index of the database. The intuition behind this scheme is that the shared concepts are likely to tell something semantically essential of the query and the underlying document set. Further constraints for recommendations are possible, too, based on metadata present in the original result set. For example a time window can be used so that the recommendation results must fit within a certain time interval based on the temporal metadata in the oldest and the newest document in the original result set. After finding a result set of recommendations, those documents that are present in the original search result set should be removed, because recommendations are by definition complementary to direct search results. This recommendation method therefore provides an entirely separate additional set of documents that are strongly related to the original search query through concepts in the ontology and the actual metadata of the expanded documents. 3 Strategies for Hybrid Search and Recommendation In order to test the architecture of Figure 1, an application using ontological concept clustering named Airo4 was implemented using a dataset of 8000 articles of the newspaper Helsingin Sanomat5. Airo provides an implementation of the ontological concept clustering as well as text and semantic search capabilities based on it. Airo was coded in Java and it uses the Jena framework6 for handling RDF(S) ontologies and Lucene7 for search and indexing tasks. Automatic annotations of the data set were created using the tool Poka8. The General Finnish Ontology (YSO)9 with over 20,000 concepts was used as the underlying ontology. YSO is a relatively simple ontology featuring associative, part of, and subclass -relations. One design goal of Airo was to ensure its scalability to datasets comprised of millions of articles as is the case with electronic archives of newspapers. To this end the application needed to be fast, be able to adapt to material that is added daily, and be compatible with arbitrary ontologies. With the test configuration, the indexing of 8000 articles took about an hour, which means that indexing a million articles could be done 4 http://www.seco.tkk.fi/tools/airo/ 5 http://www.hs.fi/ 6 http://jena.sourceforge.net/ 7 http://lucene.apache.org/ 8 http://www.seco.tkk.fi/tools/poka/ 9 http://www.yso.fi/onki/yso/ in less than a week. In reality the indexing would be done with more powerful computers which would cut down the time needed considerably. The process is also non-recurring and is needed to be done again only when the ontologies used for indexing are changed or new ones are added. Adding new articles to an existing index is very fast. Also, even though the size of the index is considerably larger, data storage space is not a concern these days when it comes to textual data. With modern search engines like Lucene, the system’s effect on search time is negligible and practically independent of the size of the index. 3.1 Evaluating Search and Recommendation Strategies In order evaluate different combinations of text search, semantic search and recommending, an evaluation test was first carried out. For this purpose, the test system of Cross Language Evaluation Forum (CLEF)10 was used. The specific version used was ELRA-E00008 The CLEF Test Suite for the CLEF 2000-2003 Campaigns whose Finnish test set is comprised of articles from the newspaper Aamulehti and search tasks connected to these. The tests were done with all of the 60 search tasks of the year 2003. The search tasks in the test suite are comprised of a title, which gives the topic of the task, a short description, which defines the task, and a longer narrative. The narrative describes the situation behind the task and the limitations on the kind of articles that are considered relevant to the query. Only the titles were used to construct the queries since Airo does not include the kind of natural language processing functions used for parsing search queries from narratives. The evaluation itself was done by comparing the articles given as a result to a search task by the system with a relevance file that lists the binary relevance of each article in the database for each query. It is worth noting that the database provided does not include any relevant documents for some of the search tasks. The pattern that was used for concept clustering is the one depicted in Table 1. 3.2 The Test and Results Five different search strategies were used for each of the search tasks: 1. Text search refers to the traditional search where the lemmatized search terms were queried from the text index. 2. In Concept search the search terms were matched with ontological concepts of the YSO ontology and these were used to query the concept index. 3. Text and concept search combines the previous two queries through Lucene’s Boolean should-operator which corresponds to a union. 4. Recommendation is comprised of the eleven most relevant articles gotten through the query expansion method described earlier. 5. Smartly combined text search and recommendation means that the fifteen most relevant text search results are listed first, after which the ten most relevant recommendation results are listed and followed by the rest of the text search results. The number fifteen was chosen here arbitrarily as a guess on how many topics a user 10 http://www.clef-campaign.org/ might scan from the text search results before looking at the recommendations if both are shown at the same time next to each other. User tests would be needed to get an accurate number for this, but its effect is rather minimal as the CLEF Test Suite does not evaluate the order of the results. A maximum of 1000 documents were considered when evaluating the result sets. The recall and precision of the five different search setups are depicted in Figure 3. Fig. 3. The precision and recall of different search setups The values of both precision and recall are between one and zero. The scores of text search should be regarded as the base level against which the others are compared to. From the figure it can be seen that the recall of both concept search and text and concept search combined are high but the precision of both is low. This is to be expected because concept search retrieves a much higher amount of documents than traditional text search and therefore returns also a large number of the relevant documents. In recommendation precision is slightly higher and recall somewhat lower than in text search, the latter of which occurs because the maximum number of returned documents was set to eleven, which is lower than the number of articles listed as relevant in the case of some search tasks. A feature worth noting here is that due to the algorithm used, the result set is completely different from the result set that was gotten for the traditional text search. This can be seen in effect in the next setup, smartly combined text research and recommendation where the recall is simply the sum of the recall of text search and recommendation. Precision on the other hand is the average of the precision of the two component methods. Straight comparison between the setups including all the results returned will not give an accurate idea of the qualities of the setups in actual intended usage of the system. An end user is not typically interested in hundreds of documents but rather scans the first few dozen results at maximum. Owing to this, precision with a certain maximum size result set is a meaningful measure and CLEF Test Suite produces this automatically. In practice this measure is calculated just like precision above, but taking into account only the n most relevant results. If the number of documents returned is less than n, the missing results are presumed wrong, which means that it is impossible to achieve perfect precision if n is larger than the total number of relevant documents for a given query in the database. When an average of the precision over all search tasks is calculated, comparing the different setups with different maximum number of returned documents is easy. This is depicted in Figure 4. Fig. 4. Average precision with a certain maximum size result set Traditional text search and recommendation have the lowest precision when viewed in this way while their combination has the highest with a low number of documents. With 15 documents or more, the text search combined with concept search is best. The aforementioned method of calculation where missing documents are considered false does skew the results especially with high maximum number of documents. When the maximum is low, though, the measure accurately simulates a real use case where the end user scans the first 10-30 results offered. This means that the simple combination of the text search and the concept search, though severely lacking in precision when considered over the whole result set, might still work in real life situations where the user is interested in only a few of the best ranked results. More tests are needed to draw definite conclusions. 3.3 Airo Application Based on the evaluation the user interface depicted in Figure 5 was implemented to use the recommendation system detailed before to accompany the traditional text search. Fig. 5. Airo user interface Recommendation was chosen as it showed improvements to the traditional text search in all of the scales used and was easy to add to the interface in an unobtrusive way that still leaves the text search in place. In Figure 5 the results of the text search are shown on the left and on the right are the eleven best results that were gotten from the recommendation algorithm. The query has been for ”Iraq Bush” from the time period of November 23rd to December 31st in 2005. The text search results include many at least seemingly relevant titles of articles, but also some less immediately clear ones like ”President Morales hopes for a political peace for Bolivia”. Recommendation also holds seemingly relevant titles, especially at the top, but also less relevant ones like ”Franco’s time is still a sore subject in Spain”. The number of recommendation results shown is a purely arbitrary number that would be simple to change, but finding the ideal would take some user testing and might depend on the dataset as well as on the ontology. The relatively low amount of recommendation articles shown hopefully keeps the user from being overwhelmed and showing them separately lets the user easily ignore them if they so wish.
منابع مشابه
Query Architecture Expansion in Web Using Fuzzy Multi Domain Ontology
Due to the increasing web, there are many challenges to establish a general framework for data mining and retrieving structured data from the Web. Creating an ontology is a step towards solving this problem. The ontology raises the main entity and the concept of any data in data mining. In this paper, we tried to propose a method for applying the "meaning" of the search system, But the problem ...
متن کاملمدل جدیدی برای جستجوی عبارت بر اساس کمینه جابهجایی وزندار
Finding high-quality web pages is one of the most important tasks of search engines. The relevance between the documents found and the query searched depends on the user observation and increases the complexity of ranking algorithms. The other issue is that users often explore just the first 10 to 20 results while millions of pages related to a query may exist. So search engines have to use sui...
متن کاملA New Hybrid Method for Web Pages Ranking in Search Engines
There are many algorithms for optimizing the search engine results, ranking takes place according to one or more parameters such as; Backward Links, Forward Links, Content, click through rate and etc. The quality and performance of these algorithms depend on the listed parameters. The ranking is one of the most important components of the search engine that represents the degree of the vitality...
متن کاملTowards Supporting Exploratory Search over the Arabic Web Content: The Case of ArabXplore
Due to the huge amount of data published on the Web, the Web search process has become more difficult, and it is sometimes hard to get the expected results, especially when the users are less certain about their information needs. Several efforts have been proposed to support exploratory search on the web by using query expansion, faceted search, or supplementary information extracted from exte...
متن کاملA Technique for Improving Web Mining using Enhanced Genetic Algorithm
World Wide Web is growing at a very fast pace and makes a lot of information available to the public. Search engines used conventional methods to retrieve information on the Web; however, the search results of these engines are still able to be refined and their accuracy is not high enough. One of the methods for web mining is evolutionary algorithms which search according to the user interests...
متن کاملAn Ensemble Click Model for Web Document Ranking
Annually, web search engine providers spend more and more money on documents ranking in search engines result pages (SERP). Click models provide advantageous information for ranking documents in SERPs through modeling interactions among users and search engines. Here, three modules are employed to create a hybrid click model; the first module is a PGM-based click model, the second module in a d...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2009